Bayesian Data Cleaning for Web Data

نویسندگان

Yuheng Hu

Sushovan De

Yi Chen

Subbarao Kambhampati

چکیده

Data Cleaning is a long standing problem, which is growing in importance with the mass of uncurated web data. State of the art approaches for handling inconsistent data are systems that learn and use conditional functional dependencies (CFDs) to rectify data. These methods learn data patterns–CFDs–from a clean sample of the data and use them to rectify the dirty/inconsistent data. While getting a clean training sample is feasible in enterprise data scenarios, it is infeasible in web databases where there is no separate curated data. CFD based methods are unfortunately particularly sensitive to noise; we will empirically demonstrate that the number of CFDs learned falls quite drastically with even a small amount of noise. In order to overcome this limitation, we propose a fully probabilistic framework for cleaning data. Our approach involves learning both the generative and error (corruption) models of the data and using them to clean the data. For generative models, we learn Bayes networks from the data. For error models, we consider a maximum entropy framework for combing multiple error processes. The generative and error models are learned directly from the noisy data. We present the details of the framework and demonstrate its effectiveness in rectifying web data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Data Cleaning of Medical Data for Knowledge Mining

Data mining or data analysis in biomedicine is different from other research fields, because the data in biomedical are heterogeneous and, and they are from different sources. Data from different medical sources are voluminous, each of the resources may have different data structure or data schema, the data quality is also different. Moreover, each physician may have its own interpretation with...

متن کامل

An Efficient Algorithm for Data Cleaning of Web Logs with Spider Navigation Removal

The World Wide Web is growing massively larger with the exponential growth of websites providing the user with heaps of information. Text files called as web logs are used to store the clicks of a user whenever a user visits a website. Web usage mining is a stream of web mining that involves the applications of mining techniques to be applied on the server logs containing the user clickstreams....

متن کامل

Scalable Probabilistic Framework for Improving Data

Recent efforts in data cleaning of structured data have focused exclusively on problems like data deduplication, record matching, and data standardization; none of the approaches addressing these problems focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which hav...

متن کامل

Quality Assurance of Government Databases

Data cleaning is a vital process that ensures the quality of data stored in real-world databases. The process of identifying the record pairs that represent the same entity (duplicate records), commonly known as record linkage, is one of the essential elements of data cleaning. Digital government serves as an emerging area for database research, such as database management, data integration, da...

متن کامل

Probabilistic Models for Anomaly Detection in Remote Sensor Data Streams

Remote sensors are becoming the standard for observing and recording ecological data in the field. Such sensors can record data at fine temporal resolutions, and they can operate under extreme conditions prohibitive to human access. Unfortunately, sensor data streams exhibit many kinds of errors ranging from corrupt communications to partial or total sensor failures. This means that the raw dat...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1204.3677 شماره

صفحات -

تاریخ انتشار 2012

Bayesian Data Cleaning for Web Data

نویسندگان

چکیده

منابع مشابه

Data Cleaning of Medical Data for Knowledge Mining

An Efficient Algorithm for Data Cleaning of Web Logs with Spider Navigation Removal

Scalable Probabilistic Framework for Improving Data

Quality Assurance of Government Databases

Probabilistic Models for Anomaly Detection in Remote Sensor Data Streams

عنوان ژورنال:

اشتراک گذاری